Introduction

Chapter link: https://otexts.com/fpp3/graphics.html

The first thing to do in any data analysis task is to plot the data. Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables. The features that are seen in plots of the data must then be incorporated, as much as possible, into the forecasting methods to be used. Just as the type of data determines what forecasting method to use, it also determines what graphs are appropriate. But before we produce graphs, we need to set up our time series in R.

2.1 tsibble objects

A time series can be thought of as a list of numbers (the measurements), along with some information about what times those numbers were recorded (the index). This information can be stored as a tsibble object in R.

library(tsibble)
library(tidyverse)
library(tsibbledata)
library(feasts)  # NB need this or get error: the functions for graphics are in the feasts package. So just load the feasts package as well and you can autoplot() a tsibble object.

#setwd("~/Desktop/Data_Science /DS_tutorials/Hyndman_Forecasting_book")
y <- tsibble(Year = 2015:2019, Observation = c(123,39,78,52,110), index = Year)
y
#> # A tsibble: 5 x 2 [1Y]
#>    Year Observation
#>   <int>       <dbl>
#> 1  2015         123
#> 2  2016          39
#> 3  2017          78
#> 4  2018          52
#> 5  2019         110

tsibble objects extend tidy data frames (tibble objects) by introducing temporal structure.

For observations that are more frequent than once per year, we need to use a time class function on the index. For example, suppose we have a monthly dataset z:

z <- tibble(Month = c("2019 Jan", "2019 Feb", "2019 Mar", "2019 Apr"), Observation = c(50, 23, 34, 30))
z
#> # A tibble: 4 x 2
#>   Month    Observation
#>   <chr>          <dbl>
#> 1 2019 Jan          50
#> 2 2019 Feb          23
#> 3 2019 Mar          34
#> 4 2019 Apr          30

This can be converted to a tsibble object using the yearmonth() function:

z2  <- z %>% 
  mutate(Month = yearmonth(Month)) %>% 
  as_tsibble(index = Month)

z2
#> # A tsibble: 4 x 2 [1M]
#>      Month Observation
#>      <mth>       <dbl>
#> 1 2019 Jan          50
#> 2 2019 Feb          23
#> 3 2019 Mar          34
#> 4 2019 Apr          30

Other time class functions can be used depending on the frequency of the observations:

z3  <- z %>% 
  mutate(Month = yearquarter(Month)) %>% 
  as_tsibble(index = Month)

z3
#> # A tsibble: 4 x 2 [1Q]
#>     Month Observation
#>     <qtr>       <dbl>
#> 1 2019 Q1          50
#> 2 2019 Q1          23
#> 3 2019 Q1          34
#> 4 2019 Q2          30

Working with tsibble objects

library(tsibbledata)
data(PBS)
PBS 
#> # A tsibble: 65,219 x 9 [1M]
#> # Key:       Concession, Type, ATC1, ATC2 [336]
#>         Month Concession  Type   ATC1  ATC1_desc  ATC2  ATC2_desc  Scripts  Cost
#>         <mth> <chr>       <chr>  <chr> <chr>      <chr> <chr>        <dbl> <dbl>
#>  1   1991 Jul Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   18228 67877
#>  2   1991 Aug Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   15327 57011
#>  3   1991 Sep Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   14775 55020
#>  4   1991 Oct Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   15380 57222
#>  5   1991 Nov Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   14371 52120
#>  6   1991 Dec Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   15028 54299
#>  7   1992 Jan Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   11040 39753
#>  8   1992 Feb Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   15165 54405
#>  9   1992 Mar Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   16898 61108
#> 10   1992 Apr Concession… Co-pa… A     Alimentar… A01   STOMATOLO…   18141 65356
#> # … with 65,209 more rows
PBS %>% 
  filter(ATC2 == "A10") %>% 
  select(Month, Concession, Type, Cost)
#> # A tsibble: 816 x 4 [1M]
#> # Key:       Concession, Type [4]
#>       Month Concession   Type           Cost
#>       <mth> <chr>        <chr>         <dbl>
#>  1 1991 Jul Concessional Co-payments 2092878
#>  2 1991 Aug Concessional Co-payments 1795733
#>  3 1991 Sep Concessional Co-payments 1777231
#>  4 1991 Oct Concessional Co-payments 1848507
#>  5 1991 Nov Concessional Co-payments 1686458
#>  6 1991 Dec Concessional Co-payments 1843079
#>  7 1992 Jan Concessional Co-payments 1564702
#>  8 1992 Feb Concessional Co-payments 1732508
#>  9 1992 Mar Concessional Co-payments 2046102
#> 10 1992 Apr Concessional Co-payments 2225977
#> # … with 806 more rows

# PBS %>% 
#   filter(ATC2 == "A10") %>% 
#   select(Month, Cost)  # invalid as we get duplicate rows for each month (?)
PBS %>%
  filter(ATC2=="A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost))
#> # A tsibble: 204 x 2 [1M]
#>       Month  TotalC
#>       <mth>   <dbl>
#>  1 1991 Jul 3526591
#>  2 1991 Aug 3180891
#>  3 1991 Sep 3252221
#>  4 1991 Oct 3611003
#>  5 1991 Nov 3565869
#>  6 1991 Dec 4306371
#>  7 1992 Jan 5088335
#>  8 1992 Feb 2814520
#>  9 1992 Mar 2985811
#> 10 1992 Apr 3204780
#> # … with 194 more rows
PBS %>%
  filter(ATC2=="A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost)) %>%
  mutate(Cost = TotalC/1e6)
#> # A tsibble: 204 x 3 [1M]
#>       Month  TotalC  Cost
#>       <mth>   <dbl> <dbl>
#>  1 1991 Jul 3526591  3.53
#>  2 1991 Aug 3180891  3.18
#>  3 1991 Sep 3252221  3.25
#>  4 1991 Oct 3611003  3.61
#>  5 1991 Nov 3565869  3.57
#>  6 1991 Dec 4306371  4.31
#>  7 1992 Jan 5088335  5.09
#>  8 1992 Feb 2814520  2.81
#>  9 1992 Mar 2985811  2.99
#> 10 1992 Apr 3204780  3.20
#> # … with 194 more rows
PBS %>%
  filter(ATC2=="A10") %>%
  select(Month, Concession, Type, Cost) %>%
  summarise(TotalC = sum(Cost)) %>%
  mutate(Cost = TotalC/1e6) -> a10

a10
#> # A tsibble: 204 x 3 [1M]
#>       Month  TotalC  Cost
#>       <mth>   <dbl> <dbl>
#>  1 1991 Jul 3526591  3.53
#>  2 1991 Aug 3180891  3.18
#>  3 1991 Sep 3252221  3.25
#>  4 1991 Oct 3611003  3.61
#>  5 1991 Nov 3565869  3.57
#>  6 1991 Dec 4306371  4.31
#>  7 1992 Jan 5088335  5.09
#>  8 1992 Feb 2814520  2.81
#>  9 1992 Mar 2985811  2.99
#> 10 1992 Apr 3204780  3.20
#> # … with 194 more rows

Read a csv file and convert to a tsibble


# Doesn't work on work laptop due to firewall

# prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv")
# 
# prison <- prison %>%
#   mutate(quarter = yearquarter(date)) %>%
#   select(-date) %>%
#   as_tsibble(key = c(state, gender, legal, indigenous), index = quarter)
# 
# prison

For a tsibble to be valid, it requires a unique index for each combination of keys. The tsibble() or as_tsibble() function will return an error if this is not true.

The seasonal period

Some graphics and some models will use the seasonal period of the data. The seasonal period is the number of observations before the seasonal pattern repeats. In most cases, this will be automatically detected using the time index variable.

For quarterly, monthly and weekly data, there is only one seasonal period — the number of observations within each year. Actually, there are not
52 weeks in a year, but 365.25/7= 52.18 on average, allowing for a leap year every fourth year. Approximating seasonal periods to integers can be useful as many seasonal terms in models only support integer seasonal periods.

If the data is observed more than once per week, then there is often more than one seasonal pattern in the data. For example, data with daily observations might have weekly (period = 7) or annual (period = 365.25) seasonal patterns. The same goes for data observed every minute (hourly, daily, weekly and annual seasonality).

More complicated (and unusual) seasonal patterns can be specified using the period() function in the lubridate package.

2.2 Time plots

For time series data, the obvious graph to start with is a time plot. That is, the observations are plotted against the time of observation, with consecutive observations joined by straight lines. Figure 2.1 below shows the weekly economy passenger load on Ansett Airlines between Australia’s two largest cities.

Let’s start by inspecting the data:

data(ansett)
tail(ansett)
#> # A tsibble: 6 x 4 [1W]
#> # Key:       Airports, Class [1]
#>       Week Airports Class Passengers
#>     <week> <chr>    <chr>      <dbl>
#> 1 1992 W42 SYD-PER  First        234
#> 2 1992 W43 SYD-PER  First        203
#> 3 1992 W44 SYD-PER  First        137
#> 4 1992 W45 SYD-PER  First        161
#> 5 1992 W46 SYD-PER  First        155
#> 6 1992 W47 SYD-PER  First        188
glimpse(ansett)
#> Rows: 7,407
#> Columns: 4
#> Key: Airports, Class [30]
#> $ Week       <week> 1989 W28, 1989 W29, 1989 W30, 1989 W31, 1989 W32, 1989 W3…
#> $ Airports   <chr> "ADL-PER", "ADL-PER", "ADL-PER", "ADL-PER", "ADL-PER", "AD…
#> $ Class      <chr> "Business", "Business", "Business", "Business", "Business"…
#> $ Passengers <dbl> 193, 254, 185, 254, 191, 136, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
summary(ansett)
#>    Week            Airports            Class             Passengers   
#>  NULL:1987 W26   Length:7407        Length:7407        Min.   :    0  
#>  NULL:1989 W14   Class :character   Class :character   1st Qu.:  209  
#>  NULL:1990 W28   Mode  :character   Mode  :character   Median :  558  
#>  NULL:1990 W22                                         Mean   : 2874  
#>  NULL:1991 W38                                         3rd Qu.: 3426  
#>  NULL:1992 W47                                         Max.   :32468

Now we plot:

library(feasts)  # NB need this or get error: the functions for graphics are in the feasts package. So just load the feasts package as well and you can autoplot() a tsibble object.

melsyd_economy <- ansett %>%
  filter(Airports == "MEL-SYD", Class=="Economy")

melsyd_economy %>%
  autoplot(Passengers) +
    labs(title = "Ansett economy class passengers", subtitle = "Melbourne-Sydney") +
    xlab("Year")

We will use the autoplot() command frequently. It automatically produces an appropriate plot of whatever you pass to it in the first argument. In this case, it recognises melsyd_economy as a time series and produces a time plot.

The time plot immediately reveals some interesting features.

Any model will need to take all these features into account in order to effectively forecast the passenger load into the future.

A simpler time series is shown in Figure 2.2, using the a10 data saved earlier:

a10 %>% autoplot(Cost) +
  ggtitle("Antidiabetic drug sales") +
  ylab("$ million") + xlab("Year")

Here, there is a clear and increasing trend. There is also a strong seasonal pattern that increases in size as the level of the series increases. The sudden drop at the start of each year is caused by a government subsidisation scheme that makes it cost-effective for patients to stockpile drugs at the end of the calendar year. Any forecasts of this series would need to capture the seasonal pattern, and the fact that the trend is changing slowly.

2.3 Time series patterns

In describing these time series, we have used words such as “trend” and “seasonal” which need to be defined more carefully.

Trend

Seasonal

A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known period. The monthly sales of antidiabetic drugs (figure 2.2) shows seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.

Cyclic

A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years.

Many people confuse cyclic behaviour with seasonal behaviour, but they are really quite different. If the fluctuations are not of a fixed frequency then they are cyclic; if the frequency is unchanging and associated with some aspect of the calendar, then the pattern is seasonal. In general, the average length of cycles is longer than the length of a seasonal pattern, and the magnitudes of cycles tend to be more variable than the magnitudes of seasonal patterns.

Many time series include trend, cycles and seasonality. When choosing a forecasting method, we will first need to identify the time series patterns in the data, and then choose a method that is able to capture the patterns properly.

The examples in Figure 2.3 show different combinations of the above components:

Fig 2.3: Four exmaples of time series showing different patterns

Fig 2.3: Four exmaples of time series showing different patterns

  1. The monthly housing sales (top left) show strong seasonality within each year, as well as some strong cyclic behaviour with a period of about 6–10 years. There is no apparent trend in the data over this period.

  2. The US treasury bill contracts (top right) show results from the Chicago market for 100 consecutive trading days in 1981. Here there is no seasonality, but an obvious downward trend. Possibly, if we had a much longer series, we would see that this downward trend is actually part of a long cycle, but when viewed over only 100 days it appears to be a trend.

  3. The Australian quarterly electricity production (bottom left) shows a strong increasing trend, with strong seasonality. There is no evidence of any cyclic behaviour here.

  4. The daily change in the Google closing stock price (bottom right) has no trend, seasonality or cyclic behaviour. There are random fluctuations which do not appear to be very predictable, and no strong patterns that would help with developing a forecasting model.

2.4 Seasonal plots

A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed. An example is given below showing the antidiabetic drug sales.

a10 %>% gg_season(Cost, labels = "both") +
  ylab("$ million") +
  ggtitle("Fig 2.4. Seasonal plot: antidiabetic drug sales")

These are exactly the same data as were shown earlier, but now the data from each season are overlapped. A seasonal plot allows the underlying seasonal pattern to be seen more clearly, and is especially useful in identifying years in which the pattern changes.

In this case, it is clear that there is a large jump in sales in January each year. Actually, these are probably sales in late December as customers stockpile before the end of the calendar year, but the sales are not registered with the government until a week or two later. The graph also shows that there was an unusually small number of sales in March 2008 (most other years show an increase between February and March). The small number of sales in June 2008 is probably due to incomplete counting of sales at the time the data were collected.

Multiple seasonal periods

Where the data has more than one seasonal pattern, the period argument can be used to select which seasonal plot is required. The vic_elec data contains half-hourly electricity demand for the state of Victoria, Australia. We can plot the daily pattern, weekly pattern or yearly pattern as follows.

vic_elec %>% gg_season(Demand, period="day") + theme(legend.position = "none")

vic_elec %>% gg_season(Demand, period="week") # + theme(legend.position = "none")

vic_elec %>% gg_season(Demand, period="year")

2.5 Seasonal subseries plots

An alternative plot that emphasises the seasonal patterns is where the data for each season are collected together in separate mini time plots.

a10 %>%
  gg_subseries(Cost) +
    ylab("$ million") +
    xlab("Year") +
    ggtitle("Seasonal subseries plot: monthly antidiabetic drug sales in Australia")

The blue horizontal lines indicate the means for each month. This form of plot enables the underlying seasonal pattern to be seen clearly, and also shows the changes in seasonality over time. It is especially useful in identifying changes within particular seasons. In this example, the plot is not particularly revealing; but in some cases, this is the most useful way of viewing seasonal changes over time.

Example: Australian holiday tourism

Australian quarterly vacation data provides an interesting example of how these plots can reveal information. First we need to extract the relevant data from the tourism tsibble. All the usual tidyverse wrangling verbs apply. To get the total visitor nights spent on Holiday by State for each quarter (i.e., ignoring Regions) we can use the following code. Note that we do not have to explicitly group by the time index as this is assumed in a tsibble.

head(tourism)
#> # A tsibble: 6 x 5 [1Q]
#> # Key:       Region, State, Purpose [1]
#>   Quarter Region   State           Purpose  Trips
#>     <qtr> <chr>    <chr>           <chr>    <dbl>
#> 1 1998 Q1 Adelaide South Australia Business  135.
#> 2 1998 Q2 Adelaide South Australia Business  110.
#> 3 1998 Q3 Adelaide South Australia Business  166.
#> 4 1998 Q4 Adelaide South Australia Business  127.
#> 5 1999 Q1 Adelaide South Australia Business  137.
#> 6 1999 Q2 Adelaide South Australia Business  200.
holidays <- tourism %>%
  filter(Purpose == "Holiday") %>%
  group_by(State) %>%
  summarise(Trips = sum(Trips))

holidays
#> # A tsibble: 640 x 3 [1Q]
#> # Key:       State [8]
#>    State Quarter Trips
#>    <chr>   <qtr> <dbl>
#>  1 ACT   1998 Q1  196.
#>  2 ACT   1998 Q2  127.
#>  3 ACT   1998 Q3  111.
#>  4 ACT   1998 Q4  170.
#>  5 ACT   1999 Q1  108.
#>  6 ACT   1999 Q2  125.
#>  7 ACT   1999 Q3  178.
#>  8 ACT   1999 Q4  218.
#>  9 ACT   2000 Q1  158.
#> 10 ACT   2000 Q2  155.
#> # … with 630 more rows

Time plots of each series shows that there is strong seasonality for most states, but that the seasonal peaks do not coincide.

holidays %>% autoplot(Trips) +
  ylab("thousands of trips") + xlab("Year") +
  ggtitle("Australian domestic holiday nights")

To see the timing of the seasonal peaks in each state, we can use a season plot.

holidays %>% gg_season(Trips) +
  ylab("thousands of trips") +
  ggtitle("Season plots of Australian domestic holidays by state")

Here it is clear that the southern states of Australia (Tasmania, Victoria and South Australia) have strongest tourism in Q1 (their summer), while the northern states (Queensland and the Northern Territory) have the strongest tourism in Q3 (their dry season).

The corresponding subseries plots are shown in Figure 2.8.

holidays %>%
  gg_subseries(Trips) + ylab("thousands of trips") +
  ggtitle("Fig 2.8. Australian domestic holidays by state")

This figure makes it evident that Western Australian tourism has jumped markedly in recent years, while Victorian tourism has increased in Q1 and Q4 but not in the middle of the year.

2.6 Scatterplots

The graphs discussed so far are useful for visualising individual time series. It is also useful to explore relationships between time series.

Figures 2.9 and 2.10 shows two time series: half-hourly electricity demand (in Gigawatts) and temperature (in degrees Celsius), for 2014 in Victoria, Australia. The temperatures are for Melbourne, the largest city in Victoria, while the demand values are for the entire state.

glimpse(vic_elec)
#> Rows: 52,608
#> Columns: 5
#> $ Time        <dttm> 2012-01-01 00:00:00, 2012-01-01 00:30:00, 2012-01-01 01:…
#> $ Demand      <dbl> 4263.366, 4048.966, 3877.563, 4036.230, 3865.597, 3694.09…
#> $ Temperature <dbl> 21.05, 20.70, 20.55, 20.40, 20.25, 20.10, 19.60, 19.10, 1…
#> $ Date        <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-…
#> $ Holiday     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
vic_elec %>%
  filter(lubridate::year(Time) == 2014) %>%
  autoplot(Demand) +
    xlab("Year: 2014") + ylab(NULL) +
    ggtitle("Fig 2.9: Half-hourly electricity demand in Victoria, Australia in 2014")

vic_elec %>%
  filter(lubridate::year(Time) == 2014) %>%
  autoplot(Temperature) +
    xlab("Year: 2014") + ylab(NULL) +
    ggtitle("Fig 2.10: Half-hourly temperatures in Melbourne, Australia in 2014")

We can study the relationship between demand and temperature by plotting one series against the other.

vic_elec %>%
  filter(lubridate::year(Time) == 2014) %>%
  ggplot(aes(x = Temperature, y = Demand)) +
    geom_point() +
    ylab("Demand (GW)") + xlab("Temperature (Celsius)") +
  ggtitle("Fig 2.11: Temp vs Demand")

This scatterplot helps us to visualise the relationship between the variables. It is clear that high demand occurs when temperatures are high due to the effect of air-conditioning. But there is also a heating effect, where demand increases for very low temperatures.

Correlation

It is common to compute correlation coefficients to measure the strength of the relationship between two variables. The correlation between variables x and y is given by the following equation:

Eqn 1: Correlation coefficient

Eqn 1: Correlation coefficient

The value of r always lies between − 1 and 1 with negative values indicating a negative relationship and positive values indicating a positive relationship. The graphs in Figure 2.12 show examples of data sets with varying levels of correlation.

Fig 2.12

Fig 2.12

The correlation coefficient only measures the strength of the linear relationship, and can sometimes be misleading. For example, the correlation for the electricity demand and temperature data shown in Figure 2.11 is 0.28, but the non-linear relationship is stronger than that.

The plots in Figure 2.13 all have correlation coefficients of 0.82, but they have very different relationships. This shows how important it is to look at the plots of the data and not simply rely on correlation values.

Fig 2.13

Fig 2.13

Scatterplot matrices

When there are several potential predictor variables, it is useful to plot each variable against each other variable. Consider the eight time series shown in Figure 2.14, showing quarterly visitor numbers across states and territories of Australia.

visitors <- tourism %>%
  group_by(State) %>%
  summarise(Trips = sum(Trips))

visitors
#> # A tsibble: 640 x 3 [1Q]
#> # Key:       State [8]
#>    State Quarter Trips
#>    <chr>   <qtr> <dbl>
#>  1 ACT   1998 Q1  551.
#>  2 ACT   1998 Q2  416.
#>  3 ACT   1998 Q3  436.
#>  4 ACT   1998 Q4  450.
#>  5 ACT   1999 Q1  379.
#>  6 ACT   1999 Q2  558.
#>  7 ACT   1999 Q3  449.
#>  8 ACT   1999 Q4  595.
#>  9 ACT   2000 Q1  600.
#> 10 ACT   2000 Q2  557.
#> # … with 630 more rows
visitors %>%
  ggplot(aes(x = Quarter, y = Trips)) +
    geom_line() +
    facet_grid(vars(State), scales = "free_y") +
    ylab("Number of visitor nights each quarter (millions)") +
  ggtitle("Fig 2.14: Quaterly visitor nights for the states and territories of Australia")

To see the relationships between these eight time series, we can plot each time series against the others. These plots can be arranged in a scatterplot matrix, as shown below (This plot requires the GGally package to be installed.)

visitors %>%
  spread(State, Trips) %>%
  GGally::ggpairs(columns = 2:9)

Note that each point on the scatterplots refers to a given quarter of a year.

For each panel, the variable on the vertical axis is given by the variable name in that row, and the variable on the horizontal axis is given by the variable name in that column. There are many options available to produce different plots within each panel. In the default version, the correlations are shown in the upper right half of the plot, while the scatterplots are shown in the lower half. On the diagonal are shown density plots.

The value of the scatterplot matrix is that it enables a quick view of the relationships between all pairs of variables. In this example, mostly positive relationships are revealed, with the strongest relationships being between the neighboring states located in the south and south east coast of Australia, namely, New South Wales, Victoria and South Australia. Some negative relationships are also revealed between the Northern Territory and other regions. The Northern Territory is located in the north of Australia famous for its outback desert landscapes visited mostly in winter. Hence, the peak visitation in the Northern Territory is in the July (winter) quarter in contrast to January (summer) quarter for the rest of the regions.

2.7 Lag plots

Figure 2.16 displays scatterplots of quarterly Australian beer production (we introduced in Figure 1.1), where the horizontal axis shows lagged values of the time series. Each graph shows \(y_{t}\) with plotted against \(y_{t-k}\) for different values of \(k\).

recent_production <- aus_production %>%
   filter(lubridate::year(Quarter) >= 1992)

recent_production %>% 
  gg_lag(Beer, geom="point") +
  ggtitle("Fig 2.16:  Lagged scatterplots for quarterly beer production")

Here the colours indicate the quarter of the variable on the vertical axis. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data. The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2)

The filter() function used here is very useful when extracting a portion of a time series. In this case, we have extracted the data from aus_production, beginning in 1992.

2.8 Autocorrelation

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

There are several autocorrelation coefficients, corresponding to each panel in the lag plot. For example, \(r1\) measures the relationship between \(y_{t}\) and \(y_{t-1}\), \(r2\) measures the relationship between \(y_{t}\) and \(y_{t-2}\) and so on.

The value of \(r_{k}\) can be written as:

Where \(T\) is the length of the time series. The autocorrelation coefficients make up the autocorrelation function or ACF.

The autocorrelation coefficients for the beer production data can be computed using the ACF() function.

recent_production %>% ACF(Beer, lag_max = 9)
#> # A tsibble: 9 x 2 [1Q]
#>     lag     acf
#>   <lag>   <dbl>
#> 1    1Q -0.102 
#> 2    2Q -0.657 
#> 3    3Q -0.0603
#> 4    4Q  0.869 
#> 5    5Q -0.0892
#> 6    6Q -0.635 
#> 7    7Q -0.0542
#> 8    8Q  0.832 
#> 9    9Q -0.108

The values in the acf column are \(r1,...,r9\), corresponding to the nine scatterplots in Figure 2.16. We usually plot the ACF to see how the correlations change with the lag \(k\). The plot is sometimes known as a correlogram.

recent_production %>% 
  ACF(Beer) %>% 
  autoplot() +
  ggtitle("Figure 2.17: Autocorrelation function of quarterly beer production.")

In this graph:

Trend and seasonality in ACF plots

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase.

When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags.

When data are both trended and seasonal, you see a combination of these effects. The a10 data plotted in Figure 2.2 shows both trend and seasonality. Its ACF is shown in Figure 2.18.

a10 %>% 
  ACF(Cost, lag_max = 48) %>% 
  autoplot() +
  ggtitle("Figure 2.18: ACF of monthly Australian electricity demand.")

The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due the seasonality.

2.9 White noise

Time series that show no autocorrelation are called white noise. Figure 2.19 gives an example of a white noise series.

set.seed(30)
y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)

y %>% 
  autoplot(wn) + ggtitle("White noise") +
  ggtitle("A white noise time series")

y %>% ACF(wn) %>% 
  autoplot() +
  ggtitle("Figure 2.20: Autocorrelation function for the white noise series")

For white noise series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within \(\mp{2} \div \sqrt{T}\) where